## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Warning: Removed 8 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 固定酸度多集中在 7-8,单峰,最高值出现在7-7.5 之间
## Warning: Removed 21 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 挥发性酸度集中在 0.4-0.6,近似正态分布
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 柠檬酸度多集中在 0-0.7 之间,无明显分布规律
## Warning: Removed 31 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note:残糖含量有明显长尾,残糖数据集中分布在 1.5-3 之间,单峰,数量最多出现在残糖含量为 2 处
## Warning: Removed 41 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note:氯化物含量有明显长尾,氯化物含量多集中在 0.06-0.1 之间,单峰,峰值出现在氯化物含量为 0.08 处
## Warning: Removed 24 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note:游离二氧化硫大于 40 为少数,数据多分布在 3-18 之间,峰值出现在游离二氧化硫浓度为 6 处
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 1326 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note:总二氧化硫浓度最低为 6,最高为 289。超过 150 的占极少数,峰值出现在总二氧化硫浓度为 28 处
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note:结合二氧化硫含量有明显长尾,大多分布在 125 以下,集中分布在 5-40 之间。
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 密度多集中在0.994-0.999 之间,单峰,近似正态分布
## Warning: Removed 2 rows containing missing values (geom_bar).
note: pH 值多集中在 3-3.5 之间,单峰,近似正太分布
## Warning: Removed 12 rows containing missing values (geom_bar).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 404 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 硫酸盐含量分布明显长尾,极少高于 0.93。单峰,近似正态分布。大多集中在 0.53-0.63 之间
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 酒精含量最低为 8.4,最高 14.9。大多分布在 9.2-11 之间,数量最多的出现在酒精含量为 9.5 处
## Warning: Removed 2 rows containing missing values (geom_bar).
note: 质量多集中在 5-6。
数据集是关于红葡萄酒质量的,包括 13 列数据,除了第一列为序号,其余列数据意义如下:
1 fixed.acidity 固定酸度(酒石酸-g/dm^3) 2 volatile.acidity 挥发性酸度(乙酸-g/dm^3) 3 citric.acid 柠檬酸(g/dm^3) 4 residual.sugar 残糖(g/dm^3) 5 chlorides 氯化物(氯化钠-g/dm^3) 6 free.sulfur.dioxide 游离二氧化硫(mg/dm^3) 7 total.sulfur.dioxide 二氧化硫总量(mg/dm^3) 8 density 密度(g/cm^3) 9 pH 10 sulphates 硫酸盐(硫酸钾-g/dm^3) 11 alcohol 酒精(体积%)输出变量(基于传感数据) 12 quality 质量
红葡萄酒的质量
红葡萄酒密度和酒精含量、糖含量有关
新创建变量“结合二氧化硫含量”,即用“二氧化硫总量”减去“游离二氧化硫含量”
未发现异常分布,没有对数据进行清洗、调整或改变数据的形式。
##
## Pearson's product-moment correlation
##
## data: pf$fixed.acidity and pf$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
note:固定酸和质量相关系数为 0.124,有一定相关性,固定酸度越高,质量越高
挥发性酸度:葡萄酒中醋酸的含量过高会导致令人不快的醋味
##
## Pearson's product-moment correlation
##
## data: pf$volatile.acidity and pf$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
note: 从挥发性酸度和质量的散点分布图相关系数 -0.39 可知,挥发性酸度和质量有相关性,挥发性酸度越低,质量越高。
##
## Pearson's product-moment correlation
##
## data: pf$citric.acid and pf$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
note:柠檬酸度和质量相关系数为 0.226,有相关性,柠檬酸度越高,质量越高。
##
## Pearson's product-moment correlation
##
## data: pf$residual.sugar and pf$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
note:残糖对红酒质量相关系数为 0.0137,影响不明显
## Warning: Removed 41 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: pf$chlorides and pf$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
note: 氯化物和质量的相关系数为 -0.129,有一定相关性,氯化物含量越低,质量越高
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
note:酒精和红酒质量相关系数为 0.476,为强相关,酒精度数越高,质量越高
##
## Pearson's product-moment correlation
##
## data: pf$density and pf$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
note:密度和质量相关系数为 -0.175,有相关性,密度越低,质量越高
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: pf$alcohol and pf$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
note:酒精和密度相关系数为 -0.49,为强相关,酒精浓度越低,密度越高
##
## Pearson's product-moment correlation
##
## data: pf$pH and pf$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
note:pH 和质量相关系数为 -0.058,相关性不强
note: 从六个散点图来看,pH 值与固定酸和柠檬酸的关系更明显,固定酸和柠檬酸含量越高,pH 值越低
##
## Pearson's product-moment correlation
##
## data: pf$fixed.acidity and pf$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
##
## Pearson's product-moment correlation
##
## data: pf$citric.acid and pf$pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5756337 -0.5063336
## sample estimates:
## cor
## -0.5419041
note:固定酸和pH相关系数为 -0.68,柠檬酸和 pH 相关系数为 -0.54,均为强相关,即这两种酸度越高,pH 越低
##
## Pearson's product-moment correlation
##
## data: pf$free.sulfur.dioxide and pf$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: pf$total.sulfur.dioxide and pf$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
note:游离二氧化硫和质量无明显关系,二氧化硫总量和质量呈弱相关,相关系数为 -0.185
通过分析两个变量之间的关系,得到了一些有意思的结论:
密度、质量、酒精两两之间有强相关,在下一步多变量分析中,这三个变量关系的分析将作为重点。
在探究固定酸度、挥发性酸、柠檬酸、残糖、氯化物、硫酸盐与 pH 的关系时发现,pH 值与固定酸和柠檬酸的关系更明显,固定酸和柠檬酸含量越高,pH 值越低,固定酸和pH相关系数为 -0.68,柠檬酸和 pH 相关系数为 -0.54,均为强相关
这道题不是和上一道重复么….按照与质量相关性从高到低的顺序排列,变量依次为酒精、挥发性酸度、柠檬酸度、密度、氯化物、固定酸
我发现固定酸和 pH 相关性最强,相关系数为 0.68,即固定酸含量越高,pH 值越低
定义 结合二氧化硫 combined.sulfur.dioxide = total.sulfur.dioxide - free.sulfur.dioxide
为何 combined.sulfur.dioxide 和 quality 的相关系数无法计算?提示 “x”和“y”长度必需相同?
从上图密集点分布可知,质量越高的红酒,可挥发酸度越低,柠檬酸度越高。
由上图可知,质量越高的红酒,酒精含量越高;酒精含量越高,密度越低
固定酸和柠檬酸呈正相关,柠檬酸度越高,固定酸度越高,红酒的质量越高,柠檬酸度越高。
构建线性模型,基于与质量相关性强的变量,对红酒品质进行预测
##
## Calls:
## m1: lm(formula = I(alcohol) ~ I(volatile.acidity), data = pf)
## m2: lm(formula = I(alcohol) ~ I(volatile.acidity) + citric.acid,
## data = pf)
## m3: lm(formula = I(alcohol) ~ I(volatile.acidity) + citric.acid +
## total.sulfur.dioxide, data = pf)
## m4: lm(formula = I(alcohol) ~ I(volatile.acidity) + citric.acid +
## total.sulfur.dioxide + fixed.acidity, data = pf)
## m5: lm(formula = I(alcohol) ~ I(volatile.acidity) + citric.acid +
## total.sulfur.dioxide + fixed.acidity + chlorides, data = pf)
##
## =========================================================================================
## m1 m2 m3 m4 m5
## -----------------------------------------------------------------------------------------
## (Intercept) 11.058*** 11.067*** 11.241*** 12.243*** 12.532***
## (0.081) (0.125) (0.124) (0.170) (0.167)
## I(volatile.acidity) -1.204*** -1.213*** -1.054*** -0.743*** -0.339*
## (0.146) (0.175) (0.173) (0.173) (0.172)
## citric.acid -0.015 0.103 1.298*** 1.905***
## (0.161) (0.159) (0.210) (0.211)
## total.sulfur.dioxide -0.006*** -0.008*** -0.008***
## (0.001) (0.001) (0.001)
## fixed.acidity -0.171*** -0.192***
## (0.020) (0.020)
## chlorides -5.614***
## (0.542)
## -----------------------------------------------------------------------------------------
## R-squared 0.041 0.041 0.078 0.117 0.173
## N 1599 1599 1599 1599 1599
## =========================================================================================
## Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05
在递归中,依次添加柠檬酸度、二氧化硫总量、固定酸度、氯化物含量等变量。构建出的线性模型为
quality = 12.532 - 0.339volatile.acidity + 1.905citric.acid - 0.008total.sulfur.dioxide - 0.192fixed.acidity - 5.614*chlorides
挥发性酸、柠檬酸、酒精含量与红酒品质关系最为密切。由项目背景可知,挥发性酸升高,会产生令人不快的醋味,柠檬酸升高,会提高红酒的新鲜感;另外,从常识可知,红酒酿造时间长,酒精含量越高,也可能会让红酒品质升高。总而言之,此次数据分析结果与背景常识较为相符。
创建了线性模型,quality = 12.532 - 0.339volatile.acidity + 1.905citric.acid - 0.008total.sulfur.dioxide - 0.192fixed.acidity - 5.614*chlorides
优点:综合前面分析,将与质量相关性较强的变量考虑在模型中了。 缺点:只考虑了线性关系,模型可能不准确。
## Warning: Ignoring unknown parameters: method
## Warning: Ignoring unknown parameters: method
依据常识或经验对红酒质量影响因素的猜测,很可能是不准确的,需要用数据验证得出结论。数据量越大,得出的结论越可靠。